Documented here are the various Microarray file formats representing the comprehensive SNP test "RAW data". These file formats are generally the result of doing a DNA Microarray Testing lab procedure. They are simplifications of the "RAW" VCF Sequencing File Format already familiar to many in the genetics community and more generally, simply termed, a TSV (or CSV) format file as termed in the computer industry. The vast majority of the content (and sometimes only in FTDNA's case) is covering the Autosomes only and hence why some call them Autosomal tests and formats. But that is a misnomer because the laboratory test and its results include SNPs from all the DNA: the Allosomes and mtDNA as well. It is the Autosomal content, also being much larger in content, that tends to be most used in a unique matching segment mechanism between testers.
Note that the microarray file formats are somewhat independent of their content and the company generating them. Meaning, they do not specify what company or version of test from that company the file contains. Or even necessarily what reference genome model was used. Making it more difficult for the development of automated analysis tools. Only with developed heuristics is a tool able to determine the source of the file and its content.
Although more than just autosomes are included in the file, the microarray file formats are not the method of documenting specific test results from most yDNA, mtDNA or NGS targeted tests. The latter use the Sequencing File Formats well known to geneticists. With that said, these microarray file formats include xDNA, yDNA and mtDNA SNP values as provided by most of the testing companies. Check the companion page SNP Databases to learn more about the content itself (for example, how an rsID compares to a common SNP name such as P312).
Some common features between the formats (and even the VCF standard) exist. For example, all use a form of Tab-Separated Value (TSV) columnated textual format. Something that grew out of the simple textual table and then follow-on spreadsheet processing. The extended Comma-Separated Value (CSV) form is a more robust format that is still textual but not really directly readable by a human. It is a superior format for capturing a variety of information but must usually be computer processed. Spreadsheets support both TSV nd CSV forms in files they read and write. Another common feature of the microarray file formats is most have one or more lines of header information. This header is free form but each line starts with the hash ("#") symbol to distinguish its content. The hash was Introduced in the 1970's by the UNIX Shell to indicate a comment line in a script file that should not be processed. These hash lines that form a header are not defined as part of the spreadsheet formats and, as such, make the microarray file formats described here a little more unique and not 100% compatible with spreadsheet programs.
Both file forms in the spreadsheet world tend to be given a .csv file extension or suffix. But in this industry, the TSV file is often given a .txt file name extension. VCF happens to also use a simple, free-form TSV design. Often the miroarray file format .csv/.txt files are compressed to save space as they contain 600 to 700 thousand diploid SNP values and can get quite large. Standard ZIP format is most often used to compress them and so the files are commonly delivered with a .zip container suffix.
These microarray file formats have one row or text line for each Probe result — most often an SNP. They identify the Probes by rsIDs as well as the chromosome (sequence) name and position within it. Some yDNA and mtDNA sole-content files use SNP names in place of the rsID. Key is, at least one of three forms is needed to identify the SNP row: (1) Chromosome name and position, (2) rsID, or (3) SNP name. Usually more than one identification form is present. Most microarray file formats use both the rsID and the chromosome name and position to identify each SNP. And sometimes the two forms conflict with each other; leading to more confusion. The chromosome coordinate position is often defined for the forward / 3'-5" / positive direction. Occasionally, and without indication, they are using a backward / 5"-3' / negative direction (and then often a complimented value). Thus leading to further confusion.
The microarray file formats are a simple form of a RAW, annotated VCF. Basic, unannotated VCFs do not include the rsID. Normal VCFs only contain the derived variants whereas microarray file formats contain all tested values — whether they are derived or ancestral. A RAW VCF, before filtering, may include ancestral value SNPs. Hence likely why you often see the term "RAW Data File" associated with these microarray file formats.
There are a number of tools that read and process the microarray file formats files. See the Third Party Analysis Tools page for more details. In particular the DNA Kit Studio allows the manipulation and even merger of various microarray test result files. Felix Immanuel was the first to provide such tools early on. Even bcftools from the Broad Institute can read the TSV files and generate basic VCF ones due to the popularity of this free form mechanism with early, simple DNA testing results.
Features and Variances
Let us first cover some of the basic features and variances of the files between vendors. And then delve into examples of the actually file formats.Feature Comparison Quick Summary
A table below summarizes the major features / content from each company. This is taken from our original chart at the bottom of the Genetic Genealogy Testing page that we introduced back in 2014.Feature | 23-v3 | 23-v4 | 23-v5 | Anc-v1 | Anc-v2 | FTDNA | NGG Geno2.0 | NGG Geno2.0+ NextGen | MyHer | LivDNA |
Approx Size (MB, compressed) | 8 | 5 | 6 | 6 | 6 | 6.5 | 1 | 6.5 | 6.4 | 6 |
Build Type | 37 | 37 | 37 | 37 | 37 | 36/37 | 37 | 37 | 37 | 38 |
DNA Type(s) | Auto,X, Y,MT | Auto,X, Y,MT | Auto,X, Y,MT | Auto,X, Y | Auto,X, Y | Auto, X | Auto,X, Y,MT | Auto,X, Y,MT | Auto, X | Auto,X, Y,MT |
SNP ID | RS#, i | RS#, i | RS#, i | RS #, i | RS #, i | RS #, i | RS #, SNP2 | RS #, SNP2 | RS #, i | RS #,i SNP2 |
Auto Probes 1-22 | 930,281 | 577,382 | 614,007 | 682,549 | 650,647 | 690,715 | 126,306 | 698,192 | 702,442 | 603,129 |
X/23 | 26,007 | 19,487 | 16,530 | ? | 25,250 | 17,478 | 3,803 | 17,812 | 17,892 | 15,511 |
Y/24 | 1,766 | 2,329 | 3,734 | 885 | 1,668 | 10,578 3 | 11,978 | 13,533 | 482 | 3822 |
MT/261 | 2,459 | 3,154 | 4,273 | - | 262 | - | 442 | 412 | - | 212 |
2 LivingDNA and NGG only report positive-for-change (derived, positive, changed) SNPs for Y and Mt. So it is not clear how many they are actually testing nor how many are ancestral (negative, un-changed). They also only report these values as a list of SNP names and not by rsID or position. Each is in a separate file.
3 FamilyTreeDNA started reprocessing their v3 kits in late 2023 to now supply the yDNA SNP values that were stripped out previously. They are providing them as a separate file in the same format but as Build38. Many have the actual yDNA SNP names. Some are just rsID nomenclature. As of this time, they no longer give the option of downloading the autosomes and xDNA separately or as different Build36 or Build37. Now only combined and in Build37.
File Sizes and Versions
It is not enough to know the vendor of your test. You need to know which version of the chip microarray (CMA) they used in the lab on your sample. And even, as it turns out, which minor version of file format they have provided your data in. Note that some of these minor versions were coding errors and later fixed. You can sometimes get an updated, corrected file simply by re-downloading a new RAW file. If you have a file that does not fit into the metrics of the chart below, please let us know so we can catalog another minor version.A quick and dirty way to figure out your particular test company and file version is to count the number of lines in your file. The number of header lines is always under two dozen and so does not really affect the rounded-to-thousands count. This count method is more reliable if you know the test company source as well. As some of the test company files for a particular version are very similar in size. The data rows / lines contain the result for one Probe or marker result from the test.
On any Unix or BASH shell, one can simply execute the command
Copy to clipboard
zcat <microarray>.zip | wc -l
Copy to clipboard
7z.exe e -so <microarray>.zip | Measure-Object -Line
Vendor | Ver sion | Start Date | End Date | File Size (K lines) | ISOGG Table (K SNPs) | WGS Extract (K SNPs) | HGR Model0 | Microarray Chip Used | ||
23andMe | API | - | Sep 2018 | - | 1,498 | Supported API interface SNP list (now researcher access only) | ||||
23andMe | v2 7 | late 2007 | - | 577, 580 | 571 | - | NCBI36 | Illumina Hap550+ (Human BeadChip) | ||
23andMe | v3 | Nov 2010 | Nov 2013 | 961 | 956 | 959 | Illumina Omniexpress (Human BeadChip) | |||
23andMe | v4 a-c | Nov 2013 | Aug 2017 | 602, 611 (599) 6 | 605 | 602 | Illumina Infinium HTS iSelect HD | |||
23andMe | v5 | Aug 2017 | - | 639 | 630 | 638 | Illumina GSA | |||
Ancestry | v1 | Jan 2012 | May 2016 | 701 4 | 700 | 701 | Illumina Omniexpress (Genotyping BeadChip) | |||
Ancestry | v2 a-b | May 2016 | May 2018 | 669 / 650 5 | - | 669 | Illumina Omniexpress+ (Genotyping BeadChip) | |||
Ancestry | v2 c-d | May 2018 | - | 664 / 678 5 | 662? | - | Illumina Omniexpress+ (Genotyping BeadChip) | |||
FTDNA | v1 | - | Feb 2011 | 564 (550) | - | 548 | HG16 / NCBI34 | Affymetrix Axiom xxx 1 (No Y, MT) | ||
FTDNA | v2 a-d | Feb 2011 | Apr 2019 | 725 (708 / 716), 720 9 | 725 (v1) 8 | 720 | Illumina OmniExpress (Microarray Chip) (No Y, MT) | |||
FTDNA | v2y | Jun 2024 | - | 15 | Illumina OmniExpress (reprocessed gv1.1 to pull out Y in separate Build 38 file) | |||||
FTDNA | v3 | Apr 2019 | - | 630 (v2) 8, 13 | 614 | Illumina GSA (no Y, MT) | ||||
FTDNA | v3y | Jan 2024 | - | 10 | Illumina GSA (reprocessed gv1.3 to pull out Y in separate Build 38 file; newly delivered others have XY and MT | |||||
LivingDNA | v1 | Sep 2016 | Oct 2018 | 619 | 619 | 619 | Illumina GSA | |||
LivingDNA | v2 | Oct 2018 | - | 692 (660 Fem)11 | 699 | 699 | Affymetrix12 Axiom Sirius | |||
MyHeritage | v1 | Nov 2016 | Mar 2019 | 721 | 720 | 721 | Illumina OmniExpress (Microarray Chip) | |||
MyHeritage | v2 | Mar 2019 | - | 607 | 610 | Illumina GSA | ||||
TellMeGen | v? | ? | - | 780 (609 / 678) | - | - | Illumina GSA | |||
MHTFR Genetics | v? | - | 640 | - | UK (no male Y sample) | |||||
Genera | v? | - | 640 | - | BR | |||||
meuDNA | v1 | - | Dec 2021 | 632 | - | BR | ||||
meuDNA | v2 | Jan 2022 | - | 654 | - | BR | ||||
SelfDecode | v? | ? | ? | 687 | - | GRCh38 | USA | |||
Reich Lab | HOv1 | ??? 2015 | - | 598 | - | Affymetrix12 Human Origins v1 | ||||
Reich Lab | 1240K | - | 1,233 | - | 1240K panel (Allen Ancient DNA Resource - AADR) | |||||
NGG Geno | v2 2 | Oct 2012 | Nov 2015 | 142 | - | NCBI36 | ||||
NGG Geno | v2+ 2 | Nov 2015 | May 2019 3 | 730 | - | NCBI36 | Illumina custom GenoChip (Y and MT in separate, SNP-list file. Y has +-. MT only variant +) | |||
WGS Extract CombinedKit | v2 | Nov 2019 | Jun 2020 | 2,080 | - | 2,080 | HG19 / GRCh37 | WGS Extract's "CombinedKit"10 (Superkit on Steriods) option from WGS Results |
note1: FTDNA retested all FamilyFinder v1 samples using the new v2 Illumina chip and replaced the output files
note2: National Geographic Genographic files are separated by chromosome type and use SNP names and not rsIDs to identify the yDNA and mtDNA entries. v2 is mainly a haplogroup test. v2+ is better known as NextGen.
note3: After Nov 2016, this is only for non-North-America orders (non-Helix, still FTDNA) till the shutdown of testing in Nov 2019.
note4: During Fall 2015 (Sep-Nov), Ancestry put out their RAW files with a truncated header not giving version numbers and other information
note5: 669 is the norm that was started with. Winter 2018 (650) and Summer 2018 (664) saw smaller sizes that were often "fixed" on request; Feb 2019 began to see larger (678) v2a-b have minor variations and similar between v2c-d. But v2b to v2c saw over 150K entries dropped and another ~150K different entries added. (SNPedia picked up on this and calls them variations 2c and 2d although we see a 4th we call 2b that they do not mention.)
note6: All of 2015 and beyond saw file sizes of 611K lines for the 23andMe v4 test with a few 599K ones scattered that year (both sexes). 602K was for 2013 and first half of 2014. The variance between kit versions is on the order of 20K entries or less.
note7: 23andMe v1 used the same chip as v2. But we have found no data about its output size and characteristics. So have left it out of the table.
note8: ISOGG chose to ignore / skip the original Affymetrix FamilyFinder test and starting numbering from 1. Most others do not follow this convention.
note9: 720K is the final standard (v2d). Pre-2015 are all reprocessed to 725K entries (v2a) (and may be really v1 kits originally?). Some 716K entry sizes (v2c) are seen in 2016 (both sexes). The earliest and single occurrence we saw of 708K entries (v2b) in 2015 is still being investigated. Note these are based on build37 model and the Auto+X download. v2a seems to be an almost exact superset of v2b-d.
note10: There are possibly as many as 10k InDels in these files that are not currently properly handled and called correctly. Genetic Genealogy sites ignore InDels so this is generally not a problem.
note11: LivingDNA supplies the yDNA and mtDNA in separate positive-for-change SNP name lists only. Main files are atxDNA only like for FamilyTreeDNA.
note12: Thermo-Fisher Scientific acquired Affymetrix and their Axiom microarray product line in 2016
note13: As of Fall 2023, FamilyTreeDNA have reprocessed their results and now supply a separate file for the yDNA SNP values. Approximately 10,579 additional entries. They also stopped providing an option to download autosomes and xDNA) separately and just have a single file for them. There is no option to download in Build36; only Build37 is being provided.
Table Sources: Reference below and Randy's 80+kits covering most versions and companies.
Note that TellMeGen, SelfDecode, MHTFR Genetics, Genera and meuDNA are not genetic genealogy focused companies. But are expanding into that area as they expand the market for their consumer DNA test product. Their result files can be used in Third Party Analysis Tools just like the others. Just as all the traditional genetic genealogy focused result files can be used on other sites that provide health, wellness and trait analysis.
Minor variations in major versions
The chart here provides a few more details on the variations of microarray file formats within a major version. Only Ancestry made a very significant change.ISOGG has since created comparison tables of the various test kits. Their covered SNP counts often vary considerably from our measured values. We have not yet determined the reason. The companies vary the outputs within a version and time period; as is shown in the table above. But this does not seem to account for ISOGGs generally lower counts.
Not incorporated in the above is an article detailing some variations in the 23andMe files for mitochondria over time. This is mostly found in files downloaded before 2012. If the file is re-downloaded, it is often corrected. Similar documented and undocumented changes occurred in 23andMe, Ancestry and FTDNA file content within major versions over time.
UCSC Templates
We have discovered "templates" for many of the microarray chips on the UCSC server. Not clear why they are there and what they use them for. They do not appear to have the vendor introduced variations. (Illumina and others let larger customers customize around 50k entries on a microarray chip. This is how NGG was able to have around 13K Y SNPs defined.) Here is the template listing found when we visited the site in 2020.Affy5 | Affy6, Affy6SV | Affy250Nsp, Affy250Sty | ||||||||
Illumina1M | Illumina1MRaw | IlluminaGDA | ||||||||
Illumina300 | IlluminaHuman660W_Quad | IlluminaHuman660W_QuadRaw | ||||||||
Illumina550 | IlluminaHumanCytoSNP_12 | IlluminaHumanCytoSNP_12Raw | ||||||||
Illumina650 | IlluminaHumanOmni1_Quad | IlluminaHumanOmni1_QuadRaw |
Study of Available Arrays
Long after we compiled the information for this page, a study has come out of the utility of the various genotyping arrays. Part of the study does include the data and analysis of the various arrays. Some of which we capture in the list below. (Sizes are in thousands of entries.) Showing much more diversity than we expected. And larger counts than expected as we thought most arrays were 1,000 x 1,000 at most (limiting the result to around 1 million entries).Array | Size | Array | Size | Array | Size | Array | Size | Array | Size | |
Affymetrix12 6.0 | 932 | Axiom AveraNTR | 671 | Axiom GW ASI | 630 | Axiom GW CHB2 | 658 | Axiom GW EUR | 675 | |
Axiom GW LAT | 818 | Axiom GW PanAFR | 2,268 | Axiom PRNA | 920 | Axiom UKB WCSG | 842 | CytoSNP 850k_b | 850 | |
Drug Dev Consortium 15073507 A1 | 475 | GSA 24v3 A1 | 653 | GSA MD 24v1-0 20011747 A4 | 693 | Human 660W quad v1 | 591 | Human Core 12v1-0 a | 298 | |
Human CytoSNP 12v2-1 H | 295 | Human Omni 2.5-4v1 h | 2,434 | Human Omni 5-4v1 c | 4,269 | Human OmniExpress 12v1-1 b | 718 | Human OmniZhongHua 8v1-0 c | 899 | |
Infinium Exome 24v1-1 A1 | 245 | Infinium Immuno Array 24v2-0 a | 252 | Multi-Ethnic AMR AFR 8v1-0 A1 | 1,425 | Multi-Ethnic EUR EAS SAS 8v1-0 A1 | 1,474 | Multi-Ethnic Global A1 | 1,761 | |
Onco Array 500K B | 498 | PMDA hg19 | 918 | Psuch Array B | 570 |
External Links
- Xcode.life
- SNPedia Company Testing Overlap with SNPedia / ClinVar, and individual Ancestry-FTDNA-23andMe entries
- Louis Kessler's summary of file formats he studied. Unfortunately, does not identify the different versions of kits and thus the different chips used. This may explain some of his discrepancies.
- Rebekkah Canada's Exploring Microarray Chips compendum articles on her haplogroup.org site (only available from archive.org now)
- Enlis Genomics blog posts
- See also the more recent works that were found (sometimes years) after writing the above:
- ISOGG Autosomal Comparison Chart — less accurate than most sources for unknown reasons. But a top level summary of the overlap seen by them between different kits. Based on Louis Kessler's work mentioned above
- NGG FAQ (archive copy)
- Tim Jansen's Rootsweb email email 30 Jan 2013 (archive copy)
- MAGE-TAB (doc)] Specification from Functional Genomics Data Society (FGED) (now defunct) which is the closest relative to a RAW file format specification we can find. More really more complex results and maybe mostly after the vendors settled on their own format. Work now really in the Genomics Standards Consortium (GSC). The GSC can be difficult to comprehend from their website but the product is more easily discovered at Fair-Sharing (GSC page). Many of the documents are more descriptions of services and sites than interchange formats and standards.
Actual File Formats
So lets get onto describing the actual file formats themselves. A reminder that all files share a few common features. For example, being a TSV or CSV format file, having headers of one or more lines that often start with a hash ("#"); but not always. And so on. Most are Build37 delivered results and sorted in an expected order of chromosomes 1-22, X, Y and MT. But variations exist and are indicated below.We start with a summary table and then introduce each of the formats. All vendors and the summary table are available one at a time by clicking the named "tab". Hit the Tab for the file format of interest. Or hit the "No Tabs" button to the far right and see all at once. Which is useful if you want to print this page.
Summary table of formats
Vendor | File Ext | File Form | File Line End | Chr Labels & Entry Order | Allele Form | Allele Values | Ref Build | IDs | Header | Notes |
23andMe | .txt | TSV | \r\n | 1-22, X, Y and MT | AG | ACGT, DI, — | 37 | rsID, iNNN | ~20 # lines including last column title row | Single value in X, Y and MT (for males); dash always homozygous. Female Y is all double dash. |
Ancestry | .txt | TSV | \r\n | 1-22, 23 (X), 24(Y), 25 (PAR), 26 (M) | T C | ACGT, ID, zero ; any order | 37 | rsID | ~18 # lines followed by column title row starting with "rsid" | Always double values; zero always homozygous. Female Y is zeros but PAR is heterozygous in both. DI only in v2c and beyond |
FTDNA | .csv | true CSV | \r\n | 0-22, (XY. MT) X (if selected) | AT | ACGT,( DI,) — | 37 (36 v1)1 | rsID, VG, (seq-rsID. kgp, 2010-, GSA, LDLR, IDS, DY, CF, DrGene, FAM, HPS, PEX, 1SNP, indel, ...) | Single row column-definition starting RSID (only unquoted value row except v3 is quoted) | v3 has wide variety of IDs; v2 only first two. Early v2 ONLY generated separate 1-22 and X files or concatenated them so header appears in middle again; InDel only in v2b and v3; chr/pos 0 in v2a and v3; only v3 has XY (?). As of late 2023, they now have a separate yDNA only file with most entries having actual SNP names; otherwise rsIDs |
LivingDNA | .txt | TSV | \n | 1-22, X | AT | ACGT, — ; any order | 37 | rsID, AX, AFFX (, 1:, exm2, JHU, var, kgp, 1kg, SNP. gw) | ~11 # lines of header including last column title row | Y and MT in separate files listing only derived SNPs; v1 has the large variance in names; v2 has >2 allele values. Often two sets of similar sequences (two inserts?) but not always (insert and delete?); longest is 21x2 |
MyHeritage | .csv | true CSV | \n | 1-22, X, Y | AT | ACGT, DI, — | 37 | rsID (,VG) | ~7-12 # lines followed by column title row starting RSID (only unquoted value row) | only v1 has VG ID's; only v2 has ID alleles and only on X ; early v2 had no quotes EXCEPT on chromosome 17 where they quoted coordinates and inserted commas as thousand separator |
TellMeGen | .csv | TSV | \r\n | 1, 10, 11 ... 22, 3, ... 9, MT, X, XY, Y | TA | ACGT, ID, — ; Any order | 37 | rsID, chr1, dupseq, ilmnseq_rs, GSA_rs, seq-rs, TOP, ... | Single row column-definition starting "# rsid" | Very large assortment of names including just a single dot |
MTHFR Gen | .txt | TSV | \r\n | 1-22, MT, X(, Y?) | TA | ACGT, ID, — ; Any order | 37 | rsID | Single row column-definition starting with "rsid" | No male sample obtained yet; One RSnnn (cap) |
meuDNA | ,csv | CSV unquoted | \n | 0-22, X, Y, (XY, )MT | AT | ACGT, DI, — | 37 | rsID, 2010-, GSA, ... (similar LivingDNA v1 but no AFF(x) | Single row column-definition starting with RSID | 782 0,0 entries ; no quotes ; no XY in v2; diff mix of IDs between v1 and v2 |
Genera | .csv | CSV unquoted | \n | 0-22, X, Y, MT | AT | ACGT, — | 37 | rsID, GSA, ilmseq, MTR, 2006, ... | single row column-definition starting with RSID | Y and MT is single value ; template only so cannot tell if InDels |
Self Decode | .txt | TSV | \n | 1-22, X, Y, MT | TA | ACGT | 38 | rsID, GSA, ilmseq, exm, seq, 1:, JHU, MFN, variant, indel, BOT, chr1:, newrs, ... | 8 lines including single row of column definitions | X, Y and MT single value (in males) |
Reich 1240K | .txt | TSV | \r\n | 1-22, X, Y, MT | TA | ACGT, — | 37 | rsID, snp_, Affx_, 1kg, Y SNP names | two lines including single row of column definitions | No format defined. So utilize 23andMe one. |
Reich HumOrig | .txt | TSV | \r\n | 1-22, 23(X), 24(Y) | TA | ACGT ; Any order | 37 | rsID, snp_, Affx_ | two lines including single row of column definitions | No format defined. So utilize 23andMe one with minor exceptions. |
NGGeno | .csv | TSV | 36 | Handled by FTDNA till near the end. Near identical files and formats. |
1 As of Fall 2023, FamilyTreeDNA has reprocessed old results and added a yDNA SNP file with approximately 10,578 entries (v3) or 15,409 entries (v2). The yDNA file is in Build38 and uses SNP names instead of rsID; when available. So not Build37 like the combined autosome and xDNA file now provided. Even earlier than the Y availability, they change from an option to download autosome only, (allosome|X)) only or combined and either in Build36 or 37; not to only download a combined X file that includes the XY and MT entries after 22 and before X. D and I entries only appear with the Y ((chromosome) files.
Sometimes the PAR region is split out from either X or Y. The PAR1 region is the same position in X and Y for build38; the X is 50k shifted in build37. The PAR2 region starts at ~95 million on X in build37 and ~99 million on build38. Any alleles defined in a PAR region of X or Y cannot be reliable distinguished as to the source. The Pesudo-Autosomal Regions for the two builds are:
Region | Build | Chr | Start | Stop | Length | |||||
PAR1 | 37 | X | 60,001 | 2,699,520 | 2.639.519 | |||||
PAR1 | 37 | Y | 10,001 | 2,649,520 | 2.639.519 | |||||
PAR1 | 38 | X or Y | 10,001 | 2,781,479 | 2,771,478 | |||||
PAR2 | 37 | X | 154,931,044 | 155,260,560 | 329,516 | |||||
PAR2 | 37 | Y | 59,034,050 | 59,363,566 | 329,516 | |||||
PAR2 | 38 | X | 155,701,383 | 156,030,895 | 329,512 | |||||
PAR2 | 38 | Y | 56,887,903 | 57,217,415 | 329,512 |
- Source: ENSEMBL PAR regions in Build 38,
23andMe
File formats from all versions are the same. But the SNPs reported change between versions of Microarray Testing chips used. To date, all versions use Illumina products.- 20 lines of header
- Tab separated (TSV) pseudo RAW-VCF file
- Column definition included
- Chromosomes labeled 1-22, X, Y and MT
- Genotype values: A, G, C, T, I, D, - (I and D are for Insert and Delete. InDels are not really SNPs but reported as such here)
- Both genotype values together (unordered pair); always increasing alphabetic order (AG but not GA)
- No calls: --
- Single value in X, Y and MT but still double dash for no call (double value for X in females; Y in females is all no call)
# This data file generated by 23andMe at: Thu Dec 17 14:11:20 2015 # # This file contains raw genotype data, including data that is not used in 23andMe reports. # This data has undergone a general quality review however only a subset of markers have been # individually validated for accuracy. As such, this data is suitable only for research, # educational, and informational use and not for medical or other use. # # Below is a text version of your data. Fields are TAB-separated # Each line corresponds to a single SNP. For each SNP, we provide its identifier # (an rsid or an internal id), its location on the reference human genome, and the # genotype call oriented with respect to the plus strand on the human reference sequence. # We are using reference human assembly ((build37)) (also known as Annotation Release 104). # Note that it is possible that data downloaded at different times may be different due to ongoing # improvements in our ability to call genotypes. More information about these changes can be found at: # https://www.23andme.com/you/download/revisions/ # # More information on reference human assembly ((build37)) (aka Annotation Release 104): # http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=9606 # # rsid chromosome position genotype rs12564807 1 734462 AA i3001395 MT 15530 --
From their Downloads page FAQ; a change log to the format:
- July 27, 2017: As part of our continuous efforts to improve the quality of data present in your raw data download, the number of SNPs available in your download may have changed.
- July 22, 2015: We updated call filtering in the downloaded file so it matches filtering in the Raw Data tool. Some customers may see "--" (a "no call") as their genotype for some SNPs on the X chromosome, Y chromosome, or in their MT DNA, where their downloaded data file previously showed a "D" call.
- July 28, 2014: Analysis of our data has allowed us to improve the interpretation of over 10,000 SNPs genome-wide on the V4 chip. In the next couple of days, V4 customers will see calls for SNPs that previously did not appear in their raw data.
- August 9, 2012: We updated our database to report SNP positions using the NCBI Build37 (also known as Annotation Release 104) genome assembly. Users will see changes in their raw data positions.
- September 29, 2011: Analysis of our data has allowed us to improve the interpretation of several SNPs. In the next week, customers may see changes in their raw data.
- January 13, 2011: We updated our database to incorporate data from a more recent build of dbSNP. Some rsIDs have changed location and/or flanking sequence in dbSNP such that our probes are no longer meaningful to assay them. The names of these rsIDs have been changed in the raw data to internal ids starting with "i499...". We have also improved the interpretation of a number of SNPs and removed others that had poor data quality. In the next couple of days, customers may see changes in calls for those SNPs.
- March 25, 2010: Analysis of our data has allowed us to improve the interpretation of several dozen SNPs. A portion of the SNPs are on the mitochondrial chromosome. In the next couple of days, customers may see changes in calls for those SNPs.
- October 8, 2009: Analysis of our data has allowed us to improve the interpretation of over 1500 SNPs. A portion of the SNPs are on the mitochondrial chromosome. In the next couple of days, customers may see changes in calls for those SNPs.
- June 4, 2009: Analysis of our data has allowed us to improve the interpretation of over 500 SNPs. Most of these SNPs are on the Y chromosome. In the next couple of days, customers will see calls for SNPs that previously had a no-call or appeared not genotyped.
- April 9, 2009: Analysis of our data has allowed us to improve the interpretation of 10 SNPs: rs4420638, rs34276300, rs3091244, rs34601266, rs2033003, rs7900194, rs9332239, rs28371685, rs1229984, and rs28399504. In the next couple of days, some customers will see calls for SNPs that previously had a no-call or appeared not genotyped.
AncestryDNA
- 16 lines of header intro, 17th line is column headers.
- Tab separated. NoCalls appear as '0' (zero) and always appear in pairs.
- Allele's in separate columns (but still unordered); can be any alphabetic order (A T and T A)
- Chromosomes labeled 1-22, 23 for X, 24 for Y, 25 for X/Y PAR region values (not sure if position from X or Y), and 26 for M (later kits only)
#This file was generated by AncestryDNA at: 06/27/2015 09:23:22 MDT #Data was collected using AncestryDNA array version: V1.0 #Data is formatted using AncestryDNA converter version: V1.0 #Below is a text version of your DNA file from Ancestry.com DNA, LLC. THIS #INFORMATION IS FOR YOUR PERSONAL USE AND IS INTENDED FOR GENEALOGICAL RESEARCH #ONLY. IT IS NOT INTENDED FOR MEDICAL OR HEALTH PURPOSES. THE EXPORTED DATA IS #SUBJECT TO THE AncestryDNA TERMS AND CONDITIONS, BUT PLEASE BE AWARE THAT THE #DOWNLOADED DATA WILL NO LONGER BE PROTECTED BY OUR SECURITY MEASURES. # #Genetic data is provided below as five TAB delimited columns. Each line #corresponds to a SNP. Column one provides the SNP identifier (rsID where #possible). Columns two and three contain the chromosome and basepair position #of the SNP using human reference build 37.1 coordinates. Columns four and five #contain the two alleles observed at this SNP (genotype). The genotype is reported #on the forward (+) strand with respect to the human reference. rsid chromosome position allele1 allele2 rs4477212 1 82154 T T
FamilyTreeDNA
FTDNA started using an Affymetrix Microarray Testing but moved to an Illumina one very quickly after introduction.- CSV with commas and each field surrounded by double quotes (true CSV)
- Single column-header definition header; no other header information
- Build37 or Build36 (selected at download time; cannot tell which by header content)
- Separate file for Auto and X (or now combined if desired)
- Chromosomes numbered 1-22; X if X file or combined file
- Both genotype values together (un-ordered pair)
- No calls: "--"
- Has entries with chromosome 0 and coordinate 0; usually no call as well.
- Later versions added XY and MT after chromosome 22 and before chromosome X. Also removed quotes from file around then. Rare error (one entry per file) had a leading double quote missing (bad format).
- 2024 saw the introduction of a separate Y file based on Build38, using SNP names instead of rsIDs (when available), and with 10 (v3) to 15K (v2) entries. Also DD and II entries (all entries are diploid but identical)
RSID,CHROMOSOME,POSITION,RESULT "rs4477212","1","72017","AA"
LivingDNA
For Autosomal & X: 10 rows of header, then single row for column headers. Tab-separated (TSV) columns in pseudo RAW VCF style. RSid identifiers.- TSV with dual column alleles
- Build37
- Separate file for Y and MT with derived, named SNPs only
- Chromosomes labeled 1-22, X
- Both alleles together; unordered pair
- Alleles are rsID, AX, or AFFX
# Living DNA customer genotype data download file version: 1.0.1 # File creation date 11-29-2017 # The content of this file is subject to updates and changes depending on the time of download. # This genotype data should be treated as personal information. # This genotype data is not suitable for clinical/medical research or diagnosis. # The user assumes all responsibility for the security of this file. # Please refer to the Living DNA Terms and Conditions on our website (www.livingdna.com) for more information. # Human Genome Reference Build 37 (GRCh37.p13). # Genotypes are presented on the forward strand. # # rsid chromosome position genotype rs9283150 1 565508 AA 1:726912 1 726912 AA rs116587930 1 727841 GG
For Y: Simple list of only derived (positive, changed) SNP names. So not clear how many tested nor any that are ancestral (negative, unchanged). Sample file has 382 entries. Each row is an SNP. Variant names appear to be given on the same row with intervening slashes (/).
Sample (Y):
AM00847/AMM008/B65 AM01921.2/S475.2/Z2983.2 CTS10083 CTS10085/M1250/PF5948
For MT, simple list of only derived (positive, changed) SNP locations. So not clear how many tested nor any that are ancestral (negative, unchanged). Sample has 21 entries (which is similar to the changed value list typical in 23andMe's test). The derived value is given attached to the position number.
Sample (MT):
263G 462T 482C
MyHeritage
6 lines of header, single line of column headings. Comma separated list of entries enclosed in double quotes (") (note: early v2 is not quoted but some tools will not accept that)., 1-22, X, Y (no MT). Double allele values. All rsID names (except v1 has some VG)Sample:
# MyHeritage DNA raw data. # This file was generated on 2018-06-18 14:06:02 # For each SNP, we provide the identifier, chromosome number, base pair position and genotype.The genotype is reported on the forward (+) strand with respect to the human reference build 37. # THIS INFORMATION IS FOR YOUR PERSONAL USE AND IS INTENDED FOR GENEALOGICAL RESEARCH # ONLY. IT IS NOT INTENDED FOR MEDICAL OR HEALTH PURPOSES. PLEASE BE AWARE THAT THE # DOWNLOADED DATA WILL NO LONGER BE PROTECTED BY OUR SECURITY MEASURES. RSID,CHROMOSOME,POSITION,RESULT "rs4477212","1","82154","AA" "rs3094315","1","752566","AG"
TellMeGen
Near identical to 23andMe format. Using Illumina GSA. Only difference is they label it a CSV file by extension but deliver a TSV like 23andMe. No header except the one line column header. Unique in that (1) is the only one with Unix-style line endings (\n only; not \r\n of DOS or \r only of MacOS), and (2) deliver a TSV format with a .csv file extension. As a result of the line endings, it broke some tools.Sample:
# rsid chromosome position genotype rs12564807 1 734462 AA i3001395 MT 15530 --
- Tab separated (TSV) pseudo RAW-VCF file with .csv file name extension (UNIQUE)
- Column definition included as only header row
- Chromosomes labeled 1-22, MT, X, and Y (in that order)
- Genotype values: A, G, C, T, I, D, - (I and D are for Insert and Delete. InDels are not really SNPs but reported as such here)
- Both genotype values together (unordered pair)
- No calls: --
Self Decode
We only have a single sample to go by that was delivered in June 2023. That sample was delivered in Build38.NGG Geno2.0
Comma separate list. First row is header title. rsID or "kgp" (1000 Genomes Project); no positions. 130,110 entries in Autosomal/X file. SNP names in Y file with 11,978 rows of values (in one example). Y file has DD and II values. ~45 MT file values so likely only variants (but from what model?)note: A combined ALL file is also delivered that has the three files mashed up together.
Sample (Geno2.0 Autosomal and X single file):
SNP,Chr,Allele1,Allele2 kgp10004422,12,A,G kgp10025979,7,C,C kgp22732377,X,A,A kgp22734373,X,C,C rs10000081,4,T,T rs10000092,4,T,T rs1000014,16,G,G
Sample (Geno2.0 Y file):
SNP,Chr,Allele1,Allele2 CTS100,Y,C,C CTS10004,Y,G,G
Sample (Geno2.0 mt File):
SNP,Chr,Allele1,Allele2 73,Mt,A,A 195,Mt,A,A 225,Mt,A,A
NGG Geno2.0+ (NextGen)
Comma separate list. First row is header titles. rsID's and position like all the others for Autosomal and X file; unlike NGG 2,0. In one sample example, 698,194 rows in Autosomal file, 17,813 in X, 13,534 in Y, xx in M (only simple list of derived value SNPs; not all tested). Typical pair of values: two from ATC or G along with I (Insert), D (Delete) and '--' (no call). Y file is like older Geno2.0 and has SNP names and no coordinates. Aliases for some SNP's given by underscore in name.note: not clear if this is always the case but files we anecdotally saw are sorted by line and not specific columns. As SNP names come first, there is an alphabetic sort on them with chromosomes totally intermixed. A combined ALL file is also delivered that has the three files mashed up together.
Sample (Geno2.0+, separate Autosomal and X files with same format):
RSID,CHROMOSOME,POSITION,RESULT rs3748597","1","878522","TC rs13303106","1","881808","AA rs28415373","1","883844","-- rs13303010","1","884436","AG
Sample (Geno2.0+, Y file):
SnpName","Chromosome","Result CTS6704","Y","AA CTS5286","Y","GG BY1786","Y","GG Y5543_Z20122","Y","CC M3153_S7535","Y","AA M245","Y","II
Sample (Geno2.0+, mt file with~40 entries; variants only):
Chromosome","Position","Result mt","2885","T mt","16230","A mt","11719","G